Skip to content

mtmd: add batching API#24384

Merged
ngxson merged 14 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_batch_api
Jun 12, 2026
Merged

mtmd: add batching API#24384
ngxson merged 14 commits into
ggml-org:masterfrom
ngxson:xsn/mtmd_batch_api

Conversation

@ngxson

@ngxson ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Overview

Supersede #24300

Also fix #24380

Add a generic batching API to mtmd and wire it up to llama-server, the goal is to speed up llava-uhd-style models and at the same time, improve video processing speed

Current state:

  • llama-server can use it correctly
  • mtmd API implement is mock up, need to implement the proper logic

TODO:

  • add notion of max batch size in mtmd
  • add CLI argument for it
  • mtmd_batch_add_chunk should only accept input with same size
  • wire up mtmd_batch_encode to use the 4th batch dim, added via mtmd: build_vit batching #24352
  • blacklist / whitelist models that can support it --> maybe only support build_vit() models for now
  • update mtmd-cli to use batching API --> skip, we don't actually need that

How it works

  1. create a new mtmd_batch object
  2. call mtmd_batch_add_chunk until it returns an error (either batch is full or current chunk can't be batched)
  3. call mtmd_batch_encode on the batch
  4. get the encoded embeddings via mtmd_batch_get_output_embd

Requirements

@sfallah

sfallah commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @ngxson,

I just wanted to thank you for the time and patience you put into reviewing my PRs. I have learned a lot about llama.cpp in general, but especially mtmd, through that work. I would like to use that experience to help the team.

If you would trust me with it, I would be glad to help with refactoring like #24384, and with the follow-up of migrating the existing models to the new batching API. The migration part especially feels like a good fit for what I have learned.

No pressure either way — just tell me the shape you want and I will follow it.

Also, related to this: I did some profiling on whether batching gives a significant speed gain, and on the GPU memory overhead, testing on an M3 Max and a few small Nvidia GPUs. On small consumer-grade GPUs the speed gain was not large. Happy to share the numbers if useful.

@ngxson ngxson marked this pull request as ready for review June 11, 2026 17:33
@ngxson ngxson requested review from a team as code owners June 11, 2026 17:33
@ngxson

ngxson commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

@sfallah yes I'd appreciate if you can adapt the deepseek-ocr model similar to gemma4v.cpp in this PR

some important notes:

  • I'm testing with batching up to 9 images in one encoder pass. on my macbook m5, I see almost no gain in performance
  • only images (or tiles) with the same size can be batched together; IIRC, ds-ocr v2 has a bigger overview image, but I could be wrong
  • the batch size is conditioned by the number of output tokens, so it's expected that some tiles cannot be processed in the same batch with the other. this is to prevent user from complaining with mmproj uses excessive memory, but we can gain some more space (and raise the batch size) if my other PR is accepted mtmd, llama: shared backend sched #24361

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@ngxson

  • only images (or tiles) with the same size can be batched together; IIRC, ds-ocr v2 has a bigger overview image, but I could be wrong

yes, in fact both ds-ocr versions have bigger (1024x1024) overview images.

  • the batch size is conditioned by the number of output tokens, so it's expected that some tiles cannot be processed in the same batch with the other. this is to prevent user from complaining with mmproj uses excessive memory

ds-ocr v2 doesn't have any issue with this, like most llava uhd style (tile slicing) models I guess.
But ds-ocr v1 concats image_newline to every token row across the whole grid width (a row mixes tokens from all tiles in that grid row), i.e. we can at minimum encode a tile grid row at a time.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

ds-ocr v2 doesn't have any issue with this, like most llava uhd style (tile slicing) models I guess.
But ds-ocr v1 concats image_newline to every token row across the whole grid width (a row mixes tokens from all tiles in that grid row), i.e. we can at minimum encode a tile grid row at a time.

no I think I misunderstood my point:

any llava-uhd style models slice the input image into multiple smaller (always square) image. for example, one big image can be sliced into 9 tiles.

without batching, cgraph only need enough memory to hold 1 image in a single decode pass. batch of 9 images mean you now need 9x memory, which can be too large, so the batch will be conditioned by the number of output tokens; it's not the most reliable way, but roughly correct. so for example, the batch will be conditioned to max 6 tiles, that means the image will be processed in 2 batches: 6 + 3

the main point is that this limit is expected and should always be respected by all models

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@ngxson
no misunderstanding.
I think I already knew what you meant.
I tried to explain that we have this extra constraint with ds-ocr v1: we can't encode tile-wise, we can at minimum encode a tile-grid row at a time.
So if we let's say have grid_x=2, grid_y=4 (a 2x4 grid; max-tiles is 9), the min batch size is 2 tiles (one grid row).
I.e. the n_tokens of one grid row (the two tiles plus the woven newlines) should fit under batch_max_tokens. And if the limit is smaller than even one row, the row gets encoded anyway - same soft behavior as your "first image will always be added" rule.

I have it almost ready, I will create a DRAFT PR so you can see it in the code.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

I tried to explain that we have this extra constraint with ds-ocr v1: we can't encode tile-wise, we can at minimum encode a tile-grid row at a time.

I don't see why we can't. you are assuming that all images in the batch must have the same number of output tokens, but that is not the case.

the batching system is flexible such that images with different number of output tokens can be different for each image in the batch. that means even one image in the batch have newline and the rest doesn't have, there is no problem at all.

assuming that a whole row need to be encoded will make the logic to be model-specific. there is always cases where you can absolutely fit multiple rows in the same batch (i.e. user simply allow larger batch)

all you need to do is to insert the newline to the correct index in the output, it can be done simply by having a loop to concat view 3d output [n_embd, n_tokens, n_batch] as slices of [n_embd, n_tokens], then concat them back while inserting a newline conditionally

and that even work if the batch is not row-aligned, for example output can be: [tile, tile, newline, tile, tile]

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

all you need to do is to insert the newline to the correct index in the output, it can be done simply by having a loop to concat view 3d output [n_embd, n_tokens, n_batch] as slices of [n_embd, n_tokens], then concat them back while inserting a newline conditionally

That is how my first implementation of ds-ocr dynamic resolution worked, see encode_deepseekocr_v1:
https://github.com/sfallah/llama.cpp/blob/sf/deepseek-ocr-mul-tile-dyn-res/tools/mtmd/mtmd.cpp#L136-L178
Tiles are encoded independently, newlines are inserted afterwards (host-side there). I linked this branch in the description of #24300 and explained why I moved away from it:

I have prepared a sequential (non-batched) alternative (see sf/deepseek-ocr-mul-tile-dyn-res) that I consider to be a hack. That is why I followed this path more seriously.

To be precise about the layout: a tile's tokens are not contiguous in the final output, so the loop has to interleave tile rows, not concat whole tiles.

And that was exactly my problem: it is ugly, model-specific and lives in mtmd. "Insert the newline to the correct index" is the model-specific part; the loop that knows the indices has to live somewhere. So my question to you: where would you put this weaving/assembly so that it stays clean and model-agnostic?

I have both variants working now (in-graph weave with row-aligned batches, flat tile batches with assembly-time weave); identical OCR output either way, including non-row-aligned splits. Draft PR coming so you can see it in code.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

please correct if I'm wrong, but let's take the non-batching version as the ground truth:

for ds-ocr-v1:

cur = ggml_concat(ctx0, cur, model.view_seperator, 1); // (n_dim, h*(w+1) + 1)

the view_seperator is always appended unconditionally to the tile, so I imagine the output will be: [tile, view_seperator, tile, view_seperator, tile, view_seperator, ...]

on the batched version, you can do that by simply concat the view_seperator to 2nd dim, it will be broadcasted to the 3rd dim (batch dim), so any other problems with it? just a simple ggml_concat()

for v2:

// view_seperator only after the global view
if (img.add_viewsep) {
cur = ggml_concat(ctx0, cur, model.view_seperator, 1); // (n_dim, 257)
}

the view_sep is only added for the overview image, which won't be batched anyway (because it's bigger than the tiles), so I think we don't even need to do a loop to assemble it. upon encoding the overview image, we can simply add the view_seperator to all images in the batch (because other images should also be the overview, they are all the same size)

To be precise about the layout: a tile's tokens are not contiguous in the final output, so the loop has to interleave tile rows, not concat whole tiles.

why they aren't contiguous? IIUC output is [n_embd, n_tokens_per_image, n_batch], so they should follow the same order as the input

@ngxson

This comment was marked as outdated.

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I think we are mixing two different things here:

  1. Batching multiple complete input images. Every encoded input image produces an output block of the same shape (for single-view v1 always: global view rows + one newline per row + one view_seperator at the end), so any number of them can be stacked -- view_seperator included, one per image. No problem there, and not what I am talking about.
  2. Encoding the tiles of ONE multi-tile input image. Here the tiles share one image_newline weave that spans across the tiles. This is the only hard part, and it is about image_newline, not view_seperator.

why they aren't contiguous? IIUC output is [n_embd, n_tokens_per_image, n_batch], so they should follow the same order as the input

Because the HF reference rearranges the tile features into the full image grid before inserting the newlines, see modeling_deepseekocr.py:
https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/modeling_deepseekocr.py

local_features = local_features.view(height_crop_num, width_crop_num, h2, w2, n_dim2).permute(0, 2, 1, 3, 4).reshape(height_crop_num*h2, width_crop_num*w2, n_dim2)
local_features = torch.cat(
    [local_features, self.image_newline[None, None, :].expand(height_crop_num * h2, 1, n_dim2)], dim=1
)
...
global_local_features = torch.cat([local_features, global_features, self.view_seperator[None, :]], dim=0)

The permute(0, 2, 1, 3, 4) swaps the tile-column axis with the row-within-tile axis: the result is the stitched image as one grid of height_crop_num*h2 rows, and image_newline goes after each of these full-width rows. So a tile's tokens are spread over h2 different rows of the final output. And view_seperator appears exactly once per input image, at the very end after the global view (last line above).

I am doing the same in ggml in my batched-encode branch (the #24300 one), see:
https://github.com/sfallah/llama.cpp/blob/sf/dsocr-mul-tile-batched-encode/tools/mtmd/models/deepseekocr.cpp#L315-L324

cur = ggml_reshape_4d(ctx0, cur, n_dim * tile_w, tile_w, grid_x, grid_y); // [n_dim*tile_w, tile_w, grid_x, grid_y]
cur = ggml_cont(ctx0, ggml_permute(ctx0, cur, 0, 2, 1, 3));
...
nl  = ggml_repeat_4d(ctx0, model.image_newline, n_dim, 1, gh, 1);
cur = ggml_reshape_3d(ctx0, cur, n_dim, gw, gh); //[n_dim, gw, gh]
cur = ggml_concat(ctx0, cur, nl, 1);

Also, master cannot be the ground truth for multi-tile: master's ds-ocr v1 is single-view only, the multi-tile path is what #24300 added. The unconditional view_seperator at the line you linked is correct there because the only image master ever encodes is the global view -- that is exactly why my branch gates it on add_viewsep.

For v2 I agree on the layout: tiles have no newline weave, their raw concatenation is already the final output, and view_seperator only follows the overview -- that is what my implementation does.

My implementation of exactly this layout scores CER 0.0000 against the HF output on the multi-tile eval -- with a tile-contiguous layout that match would not be possible.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Also, master cannot be the ground truth for multi-tile: master's ds-ocr v1 is single-view only, the multi-tile path is what #24300 added.

tbh I'm not a fan of introducing 2 changes in one PR. this is the exact root cause of the miscommunication in the past N message between us. if you have 2 different changes, please make it very clear.

what I understand is that there are 2 different subjects:

  1. for v1, you want to add multi-tile path --> it can still work without batching (independently), correct?
  2. for v2 (and v1-multi-tile), you want multiple tiles to be processed in one batch, correct?

for v1-multi-tile, please push a PR without batching support first. I will not proceed until I understand what it does.

for v2-batching, it should be the same case as existing llava-uhd model --> no siginificant problem, right?

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

what I understand is that there are 2 different subjects:

  1. for v1, you want to add multi-tile path --> it can still work without batching (independently), correct?
  2. for v2 (and v1-multi-tile), you want multiple tiles to be processed in one batch, correct?

Correct on both.

  1. v1 multi-tile works without your batching API. I will push it as a standalone PR against master so it can be reviewed on its own. For the weave I will use your method 1, since you prefer it -- I have that variant implemented and validated.
  2. Batching the tiles is then a thin layer on top of the standalone PR, and yes, for v2 it is the same case as llava-uhd, no significant problem.

Agreed on splitting the PRs.

@sfallah

sfallah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@ngxson

BTW the PR that you closed rather abruptly (#24300) already included functionally everything for a proper DSOCR v1+v2 dynamic-resolution multi-tile batched encoding of tiles in parity with HF reference impls -- carefully crafted, with a solid regression test, perf testing and profiling.

The non-batched PR you are asking for is essentially my dyn-res branch (https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr-mul-tile-dyn-res); the batched layer on top is exactly what #24300 did.

But I understand your point about splitting the PRs, so I will do that. I just want to make sure we are on the same page about the content of each PR and the implications of the different approaches. Concretely: method 1 (the in-graph weave) needs the whole grid in one graph, so the non-batched first PR will follow the dyn-res approach; the in-graph weave then belongs to the batched layer.

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

The non-batched PR you are asking for is essentially my dyn-res branch (https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr-mul-tile-dyn-res); the batched layer on top is exactly what #24300 did.

I might be a bit hard here, but the value of open source contribution is not only about "the code works", but also about planning and communication. you cannot expect pushing a large PR that contains multiple (unrelated) changes and having someone else to fully understand it, that's not how code review work.

your own comment #24300 (comment) also pointed out independent changes that can be split to smaller PR, why don't we do that instead? not necessarily 4 separate PRs, but you get the idea. I did acknowledge the first one #24352 and as a proof: the review for that PR was straight-forward.

Concretely: method 1 (the in-graph weave) needs the whole grid in one graph, so the non-batched first PR will follow the dyn-res approach; the in-graph weave then belongs to the batched layer.

I don't quite understand your intent here, but to make it clear: I expect the first version that simply doesn't use the 4th dim (n_batch); that dim should always be 1 in the cgraph

I imagine such change will affect just 2 places:

  • preprocessor of ds-ocr-v1
  • cgraph of ds-ocr-v1, to conditionally add view_seperator, similar to the if (img.add_viewsep) on v2

@ngxson

ngxson commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

merging this change after CI passes, the tests.sh is also passed:

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[vision] OK:   ggml-org/dots.ocr-GGUF:Q8_0
[vision] OK:   ggml-org/HunyuanOCR-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[audio]  OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen3-ASR-0.6B-GGUF:Q8_0

@ngxson ngxson merged commit e37abd6 into ggml-org:master Jun 12, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid re-encoding mtmd chunk when prefill MTP context

2 participants